RAPToR - Building ReferencesThis vignette is specifically focused on building references needed to stage samples with RAPToR. For a more general use of the package, see the "RAPToR" vignette.
Building references is one of the key aspects of RAPToR. An appropriate reference is needed to stage samples, and since they are re-usable we believe they are worth the trouble to set up.
Throughout this vignette, you will see the general workflow of building a reference from the selection of an appropriate dataset, to choosing and validating a model for interpolation. In the midst of the explanations, examples will be given using the same ds1 and ds2 datasets as in the general usage vignette ( you can check out how these were built at the end of the vignette ).
Finally, a few more examples of reference building on different datasets will be included at the end of this document.
I hope this material will be sufficient for your reference-building needs.
Without a transcriptomic timecourse dataset spanning the developmental stages of your samples’ organism, I’m afraid there’s not much we can do for you ! Thankfully, these time-series experiments are (increasingly) numerous in the literature and most model organisms will have some kind of dataset we can use. You may also have your own time-series data on hand.
There are a few databases you can download datasets from. The most well-known the Gene Expression Omnibus (GEO) and the Array Express.
Both of these databases have APIs to get their data from R (e.g the GEOquery package, as shown in the example dataset loading scripts).
Several points of the experimental design should be kept in mind when selecting a dataset for a reference.
One of the plagues of bioinformatics is the large and fast-changing set of IDs for genes, transcripts, etc. When you build a reference, you should always convert it to IDs that are conventional and stable. We like to use the organism-specific IDs (e.g, Wormbase for C. elegans : WBGene00016153, Flybase for Drosophila : FBgn0010583).
The ensembl biomart or its associated R package biomaRt are a very useful ressource to get gene, transcript or probe ID lists for conversion.
Below is a code snippet with an example getting gene IDs for drosophila with biomaRt.
requireNamespace("biomaRt", quietly = TRUE)
# setup connection to ensembl
mart <- biomaRt::useMart("ensembl", dataset = "dmelanogaster_gene_ensembl")
# get list of attributes
droso_genes <- biomaRt::getBM(attributes = c("ensembl_gene_id",
"ensembl_transcript_id",
"external_gene_name",
"flybase_gene_id"),
mart = mart)
head(droso_genes)
#> ensembl_gene_id ensembl_transcript_id external_gene_name flybase_gene_id
#> 1 FBgn0053882 FBtr0091886 His2B:CG33882 FBgn0053882
#> 2 FBgn0035648 FBtr0113148 CG13288 FBgn0035648
#> 3 FBgn0035648 FBtr0331563 CG13288 FBgn0035648
#> 4 FBgn0032726 FBtr0342927 CG10621 FBgn0032726
#> 5 FBgn0032726 FBtr0342926 CG10621 FBgn0032726
#> 6 FBgn0032726 FBtr0081191 CG10621 FBgn0032726
When multiple probe or transcript IDs match a single gene ID, we usually go for mean-aggregation of expression values. This is taken care of with the format_ids() function.
It’s common practice to normalize expression datasets (e.g. to account for technical bias). You may deal with many different profiling technologies when building references, and may join multiple datasets together for a reference.
To stay as consistent as possible, we apply quantile-normalization on our datasets regardless of its source. For this, we use the normalizeBetweenArrays() function of the limma package.
We also apply a \(log(X + 1)\).
ds1$g <- limma::normalizeBetweenArrays(ds1$g, method = "quantile")
ds1$g <- log(ds1$g + 1)
ds2$g <- limma::normalizeBetweenArrays(ds2$g, method = "quantile")
ds2$g <- log(ds2$g + 1)
It’s usually a good practice to take a look at what’s inside your data before doing anything else to it.
ds1$g[1:5, 1:4]
#> let.7.n2853._18hr let.7.n2853._20hr let.7.n2853._22hr let.7.n2853._24hr
#> WBGene00000001 2.619273 2.530099 2.528902 2.518151
#> WBGene00000002 2.204660 2.527502 2.095510 1.922935
#> WBGene00000003 2.210836 2.005761 2.105462 2.291663
#> WBGene00000004 2.481065 2.335821 2.190166 2.217177
#> WBGene00000005 1.586324 2.152875 1.498992 1.135504
head(ds1$p, n = 5)
#> title geo_accession organism_ch1 strain
#> GSM2113587 let.7.n2853._18hr GSM2113587 Caenorhabditis elegans let-7(n2853)
#> GSM2113588 let.7.n2853._20hr GSM2113588 Caenorhabditis elegans let-7(n2853)
#> GSM2113589 let.7.n2853._22hr GSM2113589 Caenorhabditis elegans let-7(n2853)
#> GSM2113590 let.7.n2853._24hr GSM2113590 Caenorhabditis elegans let-7(n2853)
#> GSM2113591 let.7.n2853._26hr GSM2113591 Caenorhabditis elegans let-7(n2853)
#> time in development:ch1 age
#> GSM2113587 18 hours 18
#> GSM2113588 20 hours 20
#> GSM2113589 22 hours 22
#> GSM2113590 24 hours 24
#> GSM2113591 26 hours 26
With our time series data, we can look at correlation heatmaps or boxplots of the samples to catch potential ouliers and observe the clear correlation between samples of similar development.
cor_ds1 <- cor(ds1$g, method = "spearman")
# Heatmap
ord <- order(ds1$p$age)
heatmap(cor_ds1[ord, ord], Colv = NA, Rowv = NA, scale = "none", keep.dendro = F, margins = c(1,1),
RowSideColors = as.numeric(ds1$p$strain[ord]), labRow = "", labCol = "")
par(xpd = T) # text may have to be tweaked to plot size
mtext(text = unique(ds1$p$age), side = 1, line = 4, at = seq(-.1,1.05, l = 11))
# Boxplot
boxplot(cor_ds1 ~ interaction(ds1$p$strain, ds1$p$age), col = 1:4, xaxt = "n",
ylab = "Spearman correlation", xlab = "age", at = seq(1,44, l = 55)[c(T,T,T,T,F)])
axis(side = 1, at = seq(2,42, l = 11), labels = unique(ds1$p$age))
legend(23,.86, fill = 1:4, legend = c("let-7", "lin-41", "let-7/lin-41", "N2"), bty = "n")
You can also plot components (PCA or ICA) with respect to time, to get an idea of the dynamics.
pca_ds1 <- stats::prcomp(ds1$g, rank = 25)
par(mfrow = c(2,4))
invisible(
sapply(seq_len(8), function(i){
plot(ds1$p$age, pca_ds1$rotation[,i], lwd = 2, col = ds1$p$strain,
xlab = "age", ylab = "PC", main = paste0("PC", i))
# connect the dots
sapply(seq_along(levels(ds1$p$strain)), function(l){
s <- which(ds1$p$strain == levels(ds1$p$strain)[l])
points(ds1$p$age[s], pca_ds1$rotation[s,i], col = l,
type = 'l', lty = 2)
})
if(i == 1)
legend("topleft", bty = 'n', legend = c("let-7", "lin-41", "let-7/lin-41", "N2"),
pch = c(rep(1, 4)), lty = c(rep(NA, 4)), col = c(1:4), lwd = 3)
})
)
In this data, we see that between different strains, we get very consistent dynamics. Also, PC3 and PC4 capture an oscillatory dynamic characteristic of the molting processes of C. elegans larval development.
Another approach can be to look at a few random genes. You get a first hand look at the noise in your data.
set.seed(3)
gtp <- sample(nrow(ds1$g), size = 4)
par(mfrow = c(1,4))
invisible(
sapply(gtp, function(i){
plot(ds1$p$age, ds1$g[i,], lwd = 2, col = ds1$p$strain,
xlab = "age", ylab = "GExpr", main = rownames(ds1$g)[i])
# connect the dots
sapply(seq_along(levels(ds1$p$strain)), function(l){
s <- which(ds1$p$strain == levels(ds1$p$strain)[l])
points(ds1$p$age[s], ds1$g[i,s], col = l,
type = 'l', lty = 2)
})
if(i == gtp[1])
legend("topright", bty = 'n', legend = c("let-7", "lin-41", "let-7/lin-41", "N2"),
pch = c(rep(1, 4)), lty = c(rep(NA, 4)), col = c(1:4), lwd = 3)
})
)
To increase the resolution of our time series, we are faced with a very unbalanced regression problem. We essentially want to predict tens of thousands of dependent variables (genes) with our few independent variables (time, batch, …).
We refer to the gene expression interpolation model as GEIM in the following text.
The principal strategy we put forward for predicting on such a large scale of output variables, is to interpolate in a dimensionally reduced space. We propose to do this on Principal Components or Independant Components ( Independant Component Analysis ).
Both PCA and ICA perform the same type of linear transformation on the data, they just maximize a different criteria. PCA maximizes the variance of each component and ICA their independance. We get the following :
\[ X_{(m\times n)} = G_{(m\times c)}S^{T}_{(n\times c)} \] with \(X\), the matrix of \(m\) genes by \(n\) samples, \(G\) the gene loadings (\(m\) genes by \(c\) components) and \(S^T\) the sample loadings (\(n\) samples by \(c\) components). \(S\) is what’s usually looked at when performing a PCA (or ICA) on gene expression data, to look at the samples in the component space. It’s what we plotted in the section on observing data, for instance.
Alter, Brown, and Botstein (2000) previously demonstrated that singular value decomposition of gene expression data can be taken as “eigengenes”, giving a global picture of the dynamics of gene expression. We essentially use the same property for a GEIM. We build a model on the columns of \(S^T\) (eigengenes), predict in the component space, and reconstruct the gene expression data by a matrix product with the gene loadings.
We’ve implemented 2 model types : Generalized Additive Models (GAMs, the default) and Generalized Linear Models (GLMs). GAMs rely on the gam() function of the mgcv package, and GLMs on the glm() function of the stats core package.
As you’ll see in the next section, a standard R formula will be specifed to the model. This formula can make use of all the tools one can use with gam() or glm(), most notably the variety of polynomial or smoothing splines implemented through the s() function for GAMs. With GLMs, you can also use splines from the splines core package, such as ns() for natural cubic splines.
GEIMs, are built with the ge_im() function, which outputs a geim object. This function takes as input 3 key arguments :
X : the gene expression matrix of your time series (genes as rows, samples as columns)p : a dataframe of pheno data, samples as rows. This should include the age/time variable and any other covariates you want to include in the model (e.g batch, strain)formula : the model formula. This should be a standard R formula, using terms found in p. It should start with X ~.Another important argument is the number of components used for the interpolation, nc.
For example, using the ds1 dataset we could build the following model.
m_ds1 <- ge_im(X = ds1$g, p = ds1$p, formula = "X ~ s(age, bs = 'cr') + strain", nc = 24)
Note that a single model formula is specified and applied to all the components, but the models are fitted independantly on each component.
Feel free to have a look at the documentation of the function for additionnal parameters ?ge_im().
To get model predictions, you simply use the predict() function, like for any standard R model. Note that predictions correspond to a scaled version of X because of the dimension reduction procedure.
There are 5 types of GEIM you can fit with the ge_im() function.
method = "gam", dim_red = "pca") (default)method = "glm", dim_red = "pca")method = "gam", dim_red = "ica")method = "glm", dim_red = "ica")method = "limma")The default option is a robust choice when applying a smoothing spline to the data.
Using PCA or ICA components yields near-identical results in most scenarios. ICA tends to outperform PCA when the data is very noisy. This is by design since ICA essentially performs signal extraction. It is however slower, especially if nc is large.
The last option ("limma") corresponds to a solution that makes no effort to reduce the dimensionality of the problem (the dim_red and nc arguments are ignored). As a result, there is no information loss or bias introduced by dimension reduction. This approach is however very sensitive to noise. The model is fit with the lmFit() function of the limma package (hence the name).
Note that when using GAMs, there is by definition no interaction possible between terms. It is possible to include a by argument to the s() function, but this essentialy corresponds to separate fits on the levels of the specified variable.
One can use a number of criteria to evaluate model performance. We provide the mperf() function to compute the indices described below by inputing the data and the model predictions.
In the formulas below, \(X\) corresponds to the input gene expression matrix (\(m\) genes as rows, \(n\) samples as columns), \(\hat{X}\) to the model predictions. \(x_i\) corresponds to row \(i\) of matrix \(X\) and \(x_i^{(j)}\) to sample \(j\) of that row. This notation is derived from the general regression problem, where \(X^T\) corresponds to the set of \(m\) dependant variable to predict.
aCC : average Correlation Coefficient.\[ aCC = \frac{1}{m}\sum^{m}_{i=1}{CC} = \frac{1}{m}\sum^{m}_{i=1}{\cfrac{\sum^{n}_{j=1}{(x_i^{(j)}-\bar{x}_i)(\hat{x}_i^{(j)}-\bar{\hat{x}}_i)}}{\sqrt{\sum^{n}_{j=1}{(x_i^{(j)}-\bar{x}_i)^2(\hat{x}_i^{(j)}-\bar{\hat{x}}_i)^2}}}} \]
aRE : average Relative Error.\[ a\delta = \frac{1}{m}\sum^{m}_{i=1}{\delta} = \frac{1}{m} \sum^{m}_{i=1} \frac{1}{n} \sum^{n}_{j=1} \cfrac{| x_i^{(j)} - \hat{x}_i^{(j)} | }{x_i^{(j)}} \]
MSE : Mean Squared Error.\[ MSE = \frac{1}{m} \sum^{m}_{i=1} \frac{1}{n} \sum^{n}_{j=1} (x_i^{(j)} - \hat{x}_i^{(j)} )^2 \]
aRMSE : average Root MSE.\[ aRMSE = \frac{1}{m}\sum^{m}_{i=1}{RMSE} = \frac{1}{m} \sum^{m}_{i=1} \sqrt{\cfrac{\sum^{n}_{j=1} (x_i^{(j)} - \hat{x}_i^{(j)} )^2}{n}} \]
Note that these indices are computed and averaged with respect to variables, not observations. You can either get the overall (averaged) index or the value per gene with the global parameter.
g_mp <- mperf(scale(ds1$g), predict(m_ds1), is.t = T)
g_mp
#> $aCC
#> [1] 0.817616
#>
#> $aRE
#> [1] 0.008558632
#>
#> $MSE
#> [1] 0.004256494
#>
#> $aRMSE
#> [1] 0.06524181
ng_mp <- mperf(scale(ds1$g), predict(m_ds1), is.t = T, global = F)
It’s possible to check the model performance by looking at the index distributions over all genes, e.g. :
par(mfrow = c(2,2))
invisible(
sapply(names(ng_mp), function(idx){
rg <- range(na.omit(ng_mp[[idx]]))
# label position
if(idx == "aCC"){
pos <- 2
} else {
pos <- 4
}
# estimate density curve
d <- density(na.omit(ng_mp[[idx]]), from = rg[1], to = rg[2])
plot(d, main = paste0(gsub("a", "", idx, fixed = T), " density (", length(ng_mp[[idx]]), " genes)"),
xlab = idx, lwd = 2)
# display global value
abline(v = g_mp[[idx]], lty = 2, lwd = 2, col = "firebrick")
text(g_mp[[idx]], .9*max(d$y), pos = pos, labels = idx, font = 2, col = "firebrick")
})
)
The number of components to use for the interpolation is by default set to the number of samples. However, we recommend to set a cutoff on explained variance of PCA components to select it. You can do this (on PCA explained variance) even if you choose to use ICA for interpolation.
For example, on the ds1 dataset, we set the threshold at \(99.9\%\) :
nc <- sum(summary(pca_ds1)$importance[3,] < .999) + 1
nc
#> [1] 24
Note that this threshold must be set in accordance with the noise in the data. For example, in very noisy data, would you consider that \(99.9\%\) of the variance in the dataset corresponds to meaningful dynamics ?
You can also define nc by plotting your components and stopping after the components stop capturing meaningful variation (dynamics) with respect to time/age.
In very noisy dataset, you may have to keep very few components (\(<5\)) for the interpolation.
Choosing from different splines (and/or parameters) can be done with cross-validation (CV) through the use of the ge_imCV() function. The function inputs the X, p and a formula_list to test. Other parameters on the CV itself can also be given (e.g. training set size).
The default training/validation set ratio is cv.s = 0.8, so \(80\%\) of the data is used to build the model. When including (factor) covariates in the model, the training set is built such that all groups are proportionately represented in the training set (based on terms of the first formula in the list). The number of repeats to do for the CV is defined by cv.n.
Note that the model type (GAM/GLM, PCA/ICA) is fixed for all formulas in one call of ge_imCV().
ge_imCV() computes the indices of model performance with mperf(), excluding aCC due to computing time. Indices are computed on the validation set (CV Error) and on the training set (Model PerFormance).
Below is an example of usage to choose between 4 available GAM smooth terms on the ds1 GEIM.
smooth_methods <- c("tp", "ts", "cr", "ps")
flist <- as.list(paste0("X ~ s(age, bs = \'", smooth_methods, "\') + strain"))
flist
#> [[1]]
#> [1] "X ~ s(age, bs = 'tp') + strain"
#>
#> [[2]]
#> [1] "X ~ s(age, bs = 'ts') + strain"
#>
#> [[3]]
#> [1] "X ~ s(age, bs = 'cr') + strain"
#>
#> [[4]]
#> [1] "X ~ s(age, bs = 'ps') + strain"
cv_ds1 <- ge_imCV(X = scale(ds1$g), p = ds1$p, formula_list = flist,
cv.n = 20, nc = nc, nb.cores = 3)
#> CV on 4 models. cv.n = 20 | cv.s = 0.8
#>
#> ...Building training sets
#> ...Setting up cluster
#> ...Running CV
#> ...Cleanup and formatting
plot(cv_ds1, names = paste0("bs = ", smooth_methods), outline = F,
swarmargs = list(cex = .8))
From the plots above, the cubic regression spline (cr) seems to be the best-performing choice.
You can also specify extra spline parameters for a fit. With s(), you can give the number of knots to use with the k parameters. By default, the spline is a penalized spline, so it will not necessarily use k knots, but it will stay around that value. In our experience, this is not really necessary as the parameter estimation done with gam() is usually sufficient.
To force a spline with k knots, you must also set fx = TRUE (note this fits a spline of k knots on all components, whereas the penalized spline will adjust). You can look at the s() or choose.k documentation for further information.
Below is an example with the ds1 data.
ks <- c(4,6,8,10)
flistk <- as.list(c(
"X ~ s(age, bs = 'cr') + strain",
paste0("X ~ s(age, bs = 'cr', k = ", ks , ") + strain"),
paste0("X ~ s(age, bs = 'cr', k = ", ks , ", fx=TRUE) + strain")
))
flistk
#> [[1]]
#> [1] "X ~ s(age, bs = 'cr') + strain"
#>
#> [[2]]
#> [1] "X ~ s(age, bs = 'cr', k = 4) + strain"
#>
#> [[3]]
#> [1] "X ~ s(age, bs = 'cr', k = 6) + strain"
#>
#> [[4]]
#> [1] "X ~ s(age, bs = 'cr', k = 8) + strain"
#>
#> [[5]]
#> [1] "X ~ s(age, bs = 'cr', k = 10) + strain"
#>
#> [[6]]
#> [1] "X ~ s(age, bs = 'cr', k = 4, fx=TRUE) + strain"
#>
#> [[7]]
#> [1] "X ~ s(age, bs = 'cr', k = 6, fx=TRUE) + strain"
#>
#> [[8]]
#> [1] "X ~ s(age, bs = 'cr', k = 8, fx=TRUE) + strain"
#>
#> [[9]]
#> [1] "X ~ s(age, bs = 'cr', k = 10, fx=TRUE) + strain"
cv_ds1k <- ge_imCV(X = scale(ds1$g), p = ds1$p, formula_list = flistk,
cv.n = 20, nc = nc, nb.cores = 3)
#> CV on 9 models. cv.n = 20 | cv.s = 0.8
#>
#> ...Building training sets
#> ...Setting up cluster
#> ...Running CV
#> ...Cleanup and formatting
The entire reason we need GEIMs is to interpolate. We want a higher precision time series to stage samples.
To do this, as for any R model, we will use the predict() function. We first need to set up the new data to predict from.
n.inter <- 100 # nb of new timepoints
newdat <- data.frame(
age = seq(min(ds1$p$age), max(ds1$p$age), l = n.inter),
strain = rep("N2", n.inter) # we want to predict as N2
)
head(newdat)
#> age strain
#> 1 18.00000 N2
#> 2 18.20202 N2
#> 3 18.40404 N2
#> 4 18.60606 N2
#> 5 18.80808 N2
#> 6 19.01010 N2
# predict
pred_m_ds1 <- predict(m_ds1, newdata = newdat)
You can choose to output the predictions directly as the full gene expression matrix (default) or as components. Predictions as components with the as.c = TRUE option can be useful when checking the interpolation (see the next section).
pred_m_ds1_comp <- predict(m_ds1, newdata = newdat, as.c = TRUE)
We have built our model and predicted new data, now we can have a look at the interpolation results. To do this, one can do two things.
Checking model predictions against components allows you to see immediately if some dynamics get mishandled by the model.
Don’t hope for perfect fits ! It’s perfectly acceptable to have slight offsets. You may also notice some noisy components get “flattened”, with a null model fitted. These have little impact on the results as they usually correspond to a minuscule part of total variance. They also sometimes actually get rid of some unwanted variation.
On the ds1 dataset, the predictions of the first few components are plotted below.
par(mfrow = c(2,4))
invisible(
sapply(seq_len(8), function(i){
plot(ds1$p$age, pca_ds1$rotation[,i], lwd = 2, col = ds1$p$strain,
xlab = "age", ylab = "PC", main = paste0("PC", i))
# connect the dots
sapply(seq_along(levels(ds1$p$strain)), function(l){
s <- which(ds1$p$strain == levels(ds1$p$strain)[l])
points(ds1$p$age[s], pca_ds1$rotation[s,i], col = l,
type = 'l', lty = 2)
})
# add model prediction
points(newdat$age, pred_m_ds1_comp[, i], col = "royalblue", type = 'l', lwd = 3)
if(i == 1)
legend("topleft", bty = 'n', legend = c("let-7", "lin-41", "let-7/lin-41", "N2", "pred"),
pch = c(rep(1, 4), NA), lty = c(rep(NA, 4), 1), col = c(1:4, "royalblue"), lwd = c(rep(3,4),4))
})
)
The interpolation should translates well back on the (scaled) full expression matrix :
par(mfrow = c(1,4))
sX <- scale(ds1$g) # scale X
invisible(
sapply(gtp, function(i){ # gtp is from the earlier gene plots
plot(ds1$p$age, sX[i,], lwd = 2, col = ds1$p$strain,
xlab = "age", ylab = "Scaled GExpr", main = rownames(ds1$g)[i])
# connect the dots
sapply(seq_along(levels(ds1$p$strain)), function(l){
s <- which(ds1$p$strain == levels(ds1$p$strain)[l])
points(ds1$p$age[s], sX[i,s], col = l,
type = 'l', lty = 2)
})
# add model prediction
points(newdat$age, pred_m_ds1[i, ], col = "royalblue", type = 'l', lwd = 3)
if(i == 1)
legend("topleft", bty = 'n', legend = c("let-7", "lin-41", "let-7/lin-41", "N2"),
pch = c(rep(1, 4)), lty = c(rep(NA, 4)), col = c(1:4), lwd = 3)
})
)
Testing your reference on its final purpose is a good practice.
A first test is to stage the samples you used to build the reference on their own interpolated reference.
Finally, the best validation is to stage another time series from the literature on your reference if such data exists. This provides external validation, confirming the dynamics you interpolated on indeed correspond to development processes and not a specific effect of the dataset (which is unlikely, but not impossible).
We can do this with our example, and the ds2 dataset can be used for external validation.
# make a 'reference object'
r_ds1 <- list(interpGE = pred_m_ds1, time.series = newdat$age)
ae_test_ds1 <- ae(ds1$g, r_ds1$interpGE, r_ds1$time.series)
ae_test_ds2 <- ae(ds2$g, r_ds1$interpGE, r_ds1$time.series)
par(mfrow = c(1,2))
rg <- range(c(ae_test_ds1$age.estimates[,1], ds1$p$age))
# Plot 1
plot(ae_test_ds1$age.estimates[,1]~ds1$p$age,
xlab = "Chronological age", ylab = "Estimated age (ds1)",
xlim = rg, ylim = rg,
main = "Chron. vs Estimated ages for ds1 (on ds1 reference)", lwd = 2,
col = factor(ds1$p$strain))
# connect the dots
invisible(sapply(levels(factor(ds1$p$strain)), function(l){
s <- ds1$p$strain == l
points(ae_test_ds1$age.estimates[s,1]~ds1$p$age[s], type = 'l',
lty = 2, col = which(l==levels(factor(ds1$p$strain))))
}))
abline(a = 0, b = 1, lty = 3, lwd = 2) # x = y
legend("bottomright", legend = c("let-7", "lin-41", "let-7/lin-41", "N2", "x = y"),
lwd=3, col=c(1:4, 1), bty='n', pch = c(1,1,1,1,NA), lty = c(rep(NA, 4), 3))
# Plot 2
rg <- range(c(ae_test_ds2$age.estimates[,1], ds2$p$age))
plot(ae_test_ds2$age.estimates[,1]~ds2$p$age,
xlab = "Chronological age", ylab = "Estimated age (ds1)",
xlim = rg, ylim = rg,
main = "Chron. vs Estimated ages for ds2 (on ds1 reference)", lwd = 2)
# connect the dots
points(ae_test_ds2$age.estimates[,1] ~ ds2$p$age, type = 'l', lty = 2)
abline(a = 0, b = 1, lty = 3, lwd = 2) # x = y
legend("bottomright", legend = "x = y", lwd=3, col=1, lty = 3, bty='n')
Here you will find a few examples of reference building on different organisms.
The datasets in this example are those used for all the in-text examples throughout the reference-building vignette.
We are using the two C. elegans time series datasets.
ds1. This is the dataset used to build the reference. (Accession : GSE80157)ds2. This is the dataset used for external validation. (Accession : GSE52861)Code to generate ds1 and ds2 :
Note : set the data_folder variable to an existing path on your system where you want to store the objects.
data_folder <- "../inst/extdata/"
requireNamespace("wormRef", quietly = T)
requireNamespace("utils", quietly = T)
requireNamespace("GEOquery", quietly = T) # May need to be installed with bioconductor
requireNamespace("Biobase", quietly = T)
raw2rpkm <- function(X, gene.length, id.col = 1, l.col='length'){
# Compute RPKM from raw counts
if(!all(rownames(X)%in%gene.length[, id.col])){
stop("Some genes are missing length info !")
}
res <- sapply(colnames(X), function(samp){
pm <- sum(X[,samp])/1e6
rpkm <- (X[,samp]/pm)/(gene.length[match(rownames(X), gene.length[, id.col]), l.col]/1000)
})
rownames(res) <- rownames(X)
return(res)
}
ds1geo_ds1 <- "GSE80157"
g_url_ds1 <- GEOquery::getGEOSuppFiles(geo_ds1, makeDirectory = FALSE, fetch_files = FALSE)
g_file_ds1 <- paste0(data_folder, "ds1.txt.gz")
utils::download.file(url = as.character(g_url_ds1$url[2]), destfile = g_file_ds1)
X_ds1 <- read.table(gzfile(g_file_ds1), h=T, sep = '\t', stringsAsFactors = F, row.names = 1)
# convert to rpkm & wb_id
X_ds1 <- format_ids(X_ds1, wormRef::Cel_genes, from = "wb_id", to = "wb_id")
X_ds1 <- raw2rpkm(X = X_ds1, gene.length = wormRef::Cel_genes, id.col = "wb_id", l.col = "transcript_length")
# pheno data
P_ds1 <- Biobase::pData(GEOquery::getGEO(geo_ds1, getGPL = F)[[1]])
P_ds1[,10:34] <- NULL
P_ds1[, 3:8] <- NULL
colnames(P_ds1)[4] <- "strain"
P_ds1$strain <- factor(P_ds1$strain)
P_ds1$title <- gsub('[-\\(\\);]', '.', P_ds1$title)
colnames(X_ds1) <- gsub('RNASeq_riboM_', '', colnames(X_ds1), fixed = T)
P_ds1$title <- gsub('RNASeq_riboM_', '', P_ds1$title, fixed = T)
# get age
P_ds1$age <- as.numeric(sub('(\\d+)\\shours', '\\1', P_ds1$`time in development:ch1`))
X_ds1 <- X_ds1[, P_ds1$title]
ds1 <- list(g = X_ds1, p = P_ds1)
save(ds1, file = paste0(data_folder, "ds1.RData"), compress = "xz")
# cleanup
file.remove(g_file_ds1)
rm(geo_ds1, g_url_ds1, g_file_ds1, X_ds1, P_ds1)
ds2geo_ds2 <- "GSE52861"
g_url_ds2 <- GEOquery::getGEOSuppFiles(geo_ds2, makeDirectory = FALSE, fetch_files = FALSE)
g_file_ds2 <- paste0(data_folder, "ds2.txt.gz")
utils::download.file(url = as.character(g_url_ds2$url[2]), destfile = g_file_ds2)
X_ds2 <- read.table(gzfile(g_file_ds2), h=T, sep = '\t', stringsAsFactors = F, row.names = 1)
# convert to rpkm & wb_id
X_ds2 <- format_ids(X_ds2, wormRef::Cel_genes, from = "wb_id", to = "wb_id")
X_ds2 <- raw2rpkm(X = X_ds2, gene.length = wormRef::Cel_genes, id.col = "wb_id", l.col = "transcript_length")
# pheno data
P_ds2 <- Biobase::pData(GEOquery::getGEO(geo_ds2, getGPL = F)[[1]])
# filter relevant fields/samples
P_ds2 <- P_ds2[(P_ds2$`strain:ch1` == 'N2') & (P_ds2$`growth protocol:ch1` == 'Continuous'), ]
P_ds2 <- P_ds2[, c("title", "geo_accession", "time in development:ch1")]
# get age
P_ds2$age <- as.numeric(sub('(\\d+)\\shours', '\\1', P_ds2$`time in development:ch1`))
# formatting
P_ds2$title <- gsub('RNASeq_polyA_', '',
gsub('hr', 'h',
gsub('-', '.', fixed = T, as.character(P_ds2$title))))
colnames(X_ds2) <- gsub('RNASeq_polyA_','', colnames(X_ds2))
X_ds2 <- X_ds2[, P_ds2$title]
ds2 <- list(g = X_ds2, p = P_ds2)
save(ds2, file = paste0(data_folder, "ds2.RData"), compress = "xz")
# cleanup
file.remove(g_file_ds2)
rm(geo_ds2, g_url_ds2, g_file_ds2, X_ds2, P_ds2)
ds1$g <- limma::normalizeBetweenArrays(ds1$g, method = "quantile")
ds1$g <- log(ds1$g + 1)
ds2$g <- limma::normalizeBetweenArrays(ds2$g, method = "quantile")
ds2$g <- log(ds2$g + 1)
ds1$g[1:5, 1:4]
#> let.7.n2853._18hr let.7.n2853._20hr let.7.n2853._22hr let.7.n2853._24hr
#> WBGene00000001 2.619273 2.530099 2.528902 2.518151
#> WBGene00000002 2.204660 2.527502 2.095510 1.922935
#> WBGene00000003 2.210836 2.005761 2.105462 2.291663
#> WBGene00000004 2.481065 2.335821 2.190166 2.217177
#> WBGene00000005 1.586324 2.152875 1.498992 1.135504
head(ds1$p, n = 5)
#> title geo_accession organism_ch1 strain
#> GSM2113587 let.7.n2853._18hr GSM2113587 Caenorhabditis elegans let-7(n2853)
#> GSM2113588 let.7.n2853._20hr GSM2113588 Caenorhabditis elegans let-7(n2853)
#> GSM2113589 let.7.n2853._22hr GSM2113589 Caenorhabditis elegans let-7(n2853)
#> GSM2113590 let.7.n2853._24hr GSM2113590 Caenorhabditis elegans let-7(n2853)
#> GSM2113591 let.7.n2853._26hr GSM2113591 Caenorhabditis elegans let-7(n2853)
#> time in development:ch1 age
#> GSM2113587 18 hours 18
#> GSM2113588 20 hours 20
#> GSM2113589 22 hours 22
#> GSM2113590 24 hours 24
#> GSM2113591 26 hours 26
pca_ds1 <- stats::prcomp(ds1$g, rank = 25)
nc <- sum(summary(pca_ds1)$importance[3,] < .999) + 1
nc
#> [1] 24
m_ds1 <- ge_im(X = ds1$g, p = ds1$p, formula = "X ~ s(age, bs = 'cr') + strain", nc = nc)
#> aCC aRE MSE aRMSE
#> 0.817616 0.008558632 0.004256494 0.06524181
# setup newdat
n.inter <- 100 # nb of new timepoints
newdat <- data.frame(
age = seq(min(ds1$p$age), max(ds1$p$age), l = n.inter),
strain = rep("N2", n.inter) # we want to predict as N2
)
head(newdat)
#> age strain
#> 1 18.00000 N2
#> 2 18.20202 N2
#> 3 18.40404 N2
#> 4 18.60606 N2
#> 5 18.80808 N2
#> 6 19.01010 N2
# predict
pred_m_ds1 <- predict(m_ds1, newdata = newdat)
pred_m_ds1_comp <- predict(m_ds1, newdata = newdat, as.c = TRUE)
We are using two Drosophila melanogaster embryonic development time series datasets. The dataset used to build the reference was chosen with a very low time resolution on purpose to display the effectiveness of interpolating on gene expression data
ds3. This is the dataset used to build the reference. (Data downloaded from fruitfly.org)ds4. This is the dataset used for external validation. (Accession : GSE60471)Code to generate ds3 and ds4 :
Note : set the data_folder variable to an existing path on your system where you want to store the objects.
data_folder <- "../inst/extdata/"
requireNamespace("utils", quietly = T)
requireNamespace("GEOquery", quietly = T) # May need to be installed with bioconductor
requireNamespace("Biobase", quietly = T)
raw2rpkm <- function(X, gene.length, id.col = 1, l.col='length'){
# Compute RPKM from raw counts
if(!all(rownames(X)%in%gene.length[, id.col])){
stop("Some genes are missing length info !")
}
res <- sapply(colnames(X), function(samp){
pm <- sum(X[,samp])/1e6
rpkm <- (X[,samp]/pm)/(gene.length[match(rownames(X), gene.length[, id.col]), l.col]/1000)
})
rownames(res) <- rownames(X)
return(res)
}
requireNamespace("biomaRt", quietly = TRUE)
mart <- biomaRt::useMart("ensembl", dataset = "dmelanogaster_gene_ensembl")
droso_genes <- biomaRt::getBM(attributes = c("ensembl_gene_id",
"ensembl_transcript_id",
"external_gene_name",
"transcript_length"),
mart = mart)
colnames(droso_genes)[1:3] <- c("fb_id", "transcript_id", "gene_name")
rm(mart)
ds3g_url_ds3 <- "ftp://ftp.fruitfly.org/pub/download/modencode_expression_scores/Celniker_Drosophila_Annotation_20120616_1428_allsamps_MEAN_gene_expression.csv.gz"
g_file_ds3 <- paste0(data_folder, "ds3.csv.gz")
utils::download.file(g_url_ds3, destfile = g_file_ds3)
X_ds3 <- read.table(gzfile(g_file_ds3), sep = ',', row.names = 1, h = T)
# convert gene ids to FBgn
X_ds3 <- RAPToR::format_ids(X_ds3, droso_genes, from = "gene_name", to = "fb_id")
# select embryo time series samples
X_ds3 <- X_ds3[,1:12]
P_ds3 <- data.frame(sname = colnames(X_ds3),
age = as.numeric(gsub("em(\\d+)\\.\\d+hr", "\\1", colnames(X_ds3))),
stringsAsFactors = FALSE)
ds3 <- list(g = X_ds3, p = P_ds3)
save(ds3, file = paste0(data_folder, "ds3.RData"), compress = "xz")
# cleanup
file.remove(g_file_ds3)
rm(g_url_ds3, g_file_ds3, X_ds3, P_ds3)
ds4geo_ds4 <- "GSE60471"
g_url_ds4 <- GEOquery::getGEOSuppFiles(geo_ds4, makeDirectory = FALSE, fetch_files = FALSE)
g_file_ds4 <- paste0(data_folder, "ds4.txt.gz")
utils::download.file(url = as.character(g_url_ds4$url[3]), destfile = g_file_ds4)
X_ds4 <- read.table(gzfile(g_file_ds4), h = T, sep = '\t', as.is = T, row.names = 1, comment.char = "")
# filter poor quality samples
cm_ds4 <- RAPToR::cor.gene_expr(X_ds4, X_ds4)
f_ds4 <- which(0.6 > apply(cm_ds4, 1, quantile, probs = .99))
X_ds4 <- X_ds4[, -f_ds4]
# convert to rpkm & FBgn
X_ds4 <- RAPToR::format_ids(X_ds4, droso_genes, from = "fb_id", to = "fb_id")
X_ds4 <- raw2rpkm(X = X_ds4, gene.length = droso_genes, id.col = "fb_id", l.col = "transcript_length")
# pheno data
P_ds4 <- Biobase::pData(GEOquery::getGEO(geo_ds4, getGPL = F)[[1]])
# filter relevant fields/samples
P_ds4 <- P_ds4[, c("title", "geo_accession", "time (minutes cellularization stage):ch1")]
colnames(P_ds4)[3] <- "time"
P_ds4$title <- as.character(P_ds4$title)
P_ds4 <- P_ds4[P_ds4$title %in% colnames(X_ds4),]
X_ds4 <- X_ds4[, P_ds4$title]
# formatting
P_ds4$title <- gsub('Metazome_Drosophila_timecourse_', '', P_ds4$title)
colnames(X_ds4) <- P_ds4$title
P_ds4$age <- as.numeric(P_ds4$time) / 60
ds4 <- list(g = X_ds4, p = P_ds4)
save(ds4, file = paste0(data_folder, "ds4.RData"), compress = "xz")
# cleanup
file.remove(g_file_ds4)
rm(geo_ds4, g_url_ds4, g_file_ds4, X_ds4, P_ds4)
rm(droso_genes, raw2rpkm)
ds3$g <- limma::normalizeBetweenArrays(ds3$g, method = "quantile")
ds3$g <- log(ds3$g + 1)
ds4$g <- limma::normalizeBetweenArrays(ds4$g, method = "quantile")
ds4$g <- log(ds4$g + 1)
ds3$g[1:5, 1:5]
#> em0.2hr em2.4hr em4.6hr em6.8hr em8.10hr
#> FBgn0000003 3.9651391 4.2738527 3.3174101 4.5644242 4.6982706
#> FBgn0000008 1.2949845 0.9215699 0.6958672 0.6476801 0.7445991
#> FBgn0000014 0.5099295 0.9512866 1.3952815 1.8610406 1.8421960
#> FBgn0000015 0.2435639 0.6423988 1.0511912 1.1094674 1.0194280
#> FBgn0000017 1.7968429 2.0901351 1.3389420 1.5336183 1.6777064
head(ds3$p, n = 5)
#> sname age
#> 1 em0.2hr 0
#> 2 em2.4hr 2
#> 3 em4.6hr 4
#> 4 em6.8hr 6
#> 5 em8.10hr 8
pca_ds3 <- stats::prcomp(ds3$g, rank = 12)
nc <- sum(summary(pca_ds3)$importance[3,] < .999) + 1
nc
#> [1] 10
m_ds3 <- ge_im(X = ds3$g, p = ds3$p, formula = "X ~ s(age, bs = 'cr')", nc = nc)
#> aCC aRE MSE aRMSE
#> 0.9094025 -0.04019692 0.005621654 0.07497769
# setup newdat
n.inter <- 100 # nb of new timepoints
newdat <- data.frame(
age = seq(min(ds3$p$age), max(ds3$p$age), l = n.inter)
)
head(newdat)
#> age
#> 1 0.0000000
#> 2 0.2222222
#> 3 0.4444444
#> 4 0.6666667
#> 5 0.8888889
#> 6 1.1111111
# predict
pred_m_ds3 <- predict(m_ds3, newdata = newdat)
pred_m_ds3_comp <- predict(m_ds3, newdata = newdat, as.c = TRUE)
# make a 'reference object'
r_ds3 <- list(interpGE = pred_m_ds3, time.series = newdat$age)
ae_ds3 <- ae(ds3$g, r_ds3$interpGE, r_ds3$time.series)
ae_ds4 <- ae(ds4$g, r_ds3$interpGE, r_ds3$time.series)
Notice here, that our validation dataset’s estimates appear quite noisy. However, if we look at the dynamics of the ds4 data, we’ll see that the chronological age specified for the samples is erroneous.
pca_ds4 <- stats::prcomp(ds4$g, rank = 20)
This demonstrates the difficulty of producing high-resolution time series due to developmental asynchronicity between the samples.
We are using two Danio rerio (zebrafish) embryonic development time series datasets. The dataset used to build the reference has uneven time sampling, as can often be the case.
We show a trick using ranks to build an adequate model in order to avoid interpolation bias.
The datasets are
ds5. This is the dataset used to build the reference. (Data accessible in the publication)ds6. This is the dataset used for external validation. (Accession : GSE60619)Code to generate ds5 and ds6 :
Note : set the data_folder variable to an existing path on your system where you want to store the objects.
data_folder <- "../inst/extdata/"
requireNamespace("utils", quietly = T)
requireNamespace("GEOquery", quietly = T) # May need to be installed with bioconductor
requireNamespace("Biobase", quietly = T)
raw2rpkm <- function(X, gene.length, id.col = 1, l.col='length'){
# Compute RPKM from raw counts
if(!all(rownames(X)%in%gene.length[, id.col])){
stop("Some genes are missing length info !")
}
res <- sapply(colnames(X), function(samp){
pm <- sum(X[,samp])/1e6
rpkm <- (X[,samp]/pm)/(gene.length[match(rownames(X), gene.length[, id.col]), l.col]/1000)
})
rownames(res) <- rownames(X)
return(res)
}
requireNamespace("biomaRt", quietly = TRUE)
mart <- biomaRt::useMart("ensembl", dataset = "drerio_gene_ensembl")
zeb_genes <- biomaRt::getBM(attributes = c("ensembl_gene_id", "transcript_length"),
mart = mart)
rm(mart)
ds5p_url_ds5 <- "http://europepmc.org/articles/PMC5690287/bin/elife-30860-supp1.tsv"
g_url_ds5 <- "http://europepmc.org/articles/PMC5690287/bin/elife-30860-supp2.tsv"
g_file_ds5 <- paste0(data_folder, "ds5.tsv")
utils::download.file(g_url_ds5, destfile = g_file_ds5)
X_ds5 <- read.table(g_file_ds5, h = T, sep ="\t", as.is = T, quote = "\"")
rownames(X_ds5) <- X_ds5$Gene.ID
X_ds5 <- X_ds5[,-(1:8)]
# convert to rpkm & ensembl_id
X_ds5 <- RAPToR::format_ids(X_ds5, zeb_genes, from = "ensembl_gene_id", to = "ensembl_gene_id")
X_ds5 <- raw2rpkm(X = X_ds5, gene.length = zeb_genes, id.col = "ensembl_gene_id", l.col = "transcript_length")
# pheno data
P_ds5 <- read.table(p_url_ds5, h = T, sep = "\t", as.is = T)
P_ds5 <- P_ds5[P_ds5$sequencing == "RNASeq", c("sample", "accession_number", "stage", "stageName", "sampleName")]
# timings of stages from the White et al. eLife (2017) publication of the data.
# time given in hours post-fertilization
timepoints <- data.frame(stage = unique(P_ds5$stageName),
hours_pf = c(0, .75, 2.25, 3, 4.3, 5.25, 6, 8, 10.3,
16, 19, 24, 30, 36, 48, 72, 96, 120),
stringsAsFactors = F, row.names = "stage")
P_ds5$age <- timepoints[P_ds5$stageName, "hours_pf"]
P_ds5$batch <- factor(gsub(".*-(\\d)$", "\\1", P_ds5$sampleName))
X_ds5 <- X_ds5[, P_ds5$sample]
ds5 <- list(g = X_ds5, p = P_ds5)
save(ds5, file = paste0(data_folder, "ds5.RData"), compress = "xz")
# cleanup
file.remove(g_file_ds5)
rm(p_url_ds5, g_url_ds5, g_file_ds5, X_ds5, P_ds5, timepoints)
ds6geo_ds6 <- "GSE60619"
g_url_ds6 <- GEOquery::getGEOSuppFiles(geo_ds6, makeDirectory = FALSE, fetch_files = FALSE)
g_file_ds6 <- paste0(data_folder, "ds6.txt.gz")
utils::download.file(url = as.character(g_url_ds6$url[2]), destfile = g_file_ds6)
X_ds6 <- read.table(gzfile(g_file_ds6), h = T, sep = '\t', as.is = T, row.names = 1, comment.char = "")
# convert to rpkm & ensembl id
X_ds6 <- RAPToR::format_ids(X_ds6, zeb_genes, from = "ensembl_gene_id", to = "ensembl_gene_id")
X_ds6 <- raw2rpkm(X = X_ds6, gene.length = zeb_genes, id.col = "ensembl_gene_id", l.col = "transcript_length")
# pheno data
P_ds6 <- Biobase::pData(GEOquery::getGEO(geo_ds6, getGPL = F)[[1]])
# filter relevant fields/samples
P_ds6 <- P_ds6[, c("title", "geo_accession", "time (min after fertilization):ch1")]
colnames(P_ds6)[3] <- "time"
P_ds6$title <- as.character(P_ds6$title)
P_ds6 <- P_ds6[P_ds6$title %in% colnames(X_ds6),]
X_ds6 <- X_ds6[, P_ds6$title]
# formatting
P_ds6$title <- gsub('Metazome_ZF_timecourse_', '', P_ds6$title)
colnames(X_ds6) <- P_ds6$title
P_ds6$age <- as.numeric(P_ds6$time) / 60
ds6 <- list(g = X_ds6, p = P_ds6)
save(ds6, file = paste0(data_folder, "ds6.RData"), compress = "xz")
# cleanup
file.remove(g_file_ds6)
rm(geo_ds6, g_url_ds6, g_file_ds6, X_ds6, P_ds6)
rm(zeb_genes, raw2rpkm)
ds5$g <- limma::normalizeBetweenArrays(ds5$g, method = "quantile")
ds5$g <- log(ds5$g + 1)
ds6$g <- limma::normalizeBetweenArrays(ds6$g, method = "quantile")
ds6$g <- log(ds6$g + 1)
ds5$g[1:5, 1:5]
#> zmp_ph133_B zmp_ph133_D zmp_ph133_E zmp_ph133_F zmp_ph133_G
#> ENSDARG00000000001 1.992991 1.827910 1.7427071 1.839980 1.7305791
#> ENSDARG00000000002 1.013519 1.049910 0.7850242 1.046494 0.8564017
#> ENSDARG00000000018 2.247738 2.322763 2.0246236 2.159045 2.3384767
#> ENSDARG00000000019 4.198081 4.295515 4.3738184 4.298936 4.3578940
#> ENSDARG00000000068 4.172288 4.226427 4.0332980 4.059632 3.9252796
head(ds5$p, n = 5)
#> sample accession_number stage stageName sampleName age batch
#> 1 zmp_ph133_B ERS1079239 Zygote:1-cell 1-cell 1-cell-1 0 1
#> 2 zmp_ph133_D ERS1079240 Zygote:1-cell 1-cell 1-cell-2 0 2
#> 3 zmp_ph133_E ERS1079241 Zygote:1-cell 1-cell 1-cell-3 0 3
#> 4 zmp_ph133_F ERS1079243 Zygote:1-cell 1-cell 1-cell-4 0 4
#> 5 zmp_ph133_G ERS1079244 Zygote:1-cell 1-cell 1-cell-5 0 5
pca_ds5 <- stats::prcomp(ds5$g, rank = 25)
Notice how the sampling is sparser towards the end of the time series, with dynamics being “wider”. Fitting splines on the components here will lead to a poor fit of the earlier timepoints.
To bypass this issue, we can use ranks instead of the timepoints.
# using data.table's rank function to get the "dense" tie method
ds5$p$rank <- data.table::frank(ds5$p$age, ties.method = "dense")
These dynamics will be fitted much more cleanly. To predict the data in a uniform time scale, we can just pick values on the rank scale such that they translate to a uniform series on the age scale with a simple linear warp, as will be done below.
nc <- sum(summary(pca_ds5)$importance[3,] < .999) + 1
nc
#> [1] 77
m_ds5 <- ge_im(X = ds5$g, p = ds5$p, formula = "X ~ s(rank, bs = 'ds') + batch", nc = nc)
#> aCC aRE MSE aRMSE
#> 0.8772099 0.2281205 0.01266176 0.1125245
We’ll be using a linear warp to get a uniform time series.
linwarp <- function(x, xyt, xc = 1, yc = 2){
# Computes a linear interpolation of input x to y value
# x = values of x to convert to y
# xyt = table with known sets of x/y
# xc, yc = column indices of x and y in xyt
if(min(x) < min(xyt[,xc]) | max(x) > max(xyt[,xc]))
stop("Some values of x are outside of the known x/y sets")
# set up y to x conversion table
xyt <- xyt[!duplicated.default(xyt[,xc]),]
xyt <- xyt[order(xyt[,xc]),]
xyt[,"dify"] <- c(0, diff(xyt[,yc]))
xyt[,"difx"] <- c(1, diff(xyt[,xc]))
xyt <- rbind(xyt[1,], xyt) # double 1st line for edge case
xout <- unlist(sapply(x, function(xi){
rsup <- which(xyt[-1,xc] >= xi)[1] + 1
xyt[rsup-1, yc] + (xi - xyt[rsup-1, xc])/xyt[rsup, "difx"] * xyt[rsup, "dify"]
}))
return(xout)
}
# setup newdat
n.inter <- 200
newdat <- data.frame(age = seq(min(ds5$p[, "age"]), max(ds5$p[, "age"]), l = n.inter),
batch = rep("1", n.inter)) # predict as batch 1
# apply linwarp
newdat$rank <- linwarp(newdat$age, xyt = ds5$p[, c("age", "rank")], xc = 1, yc = 2)
head(newdat)
#> age batch rank
#> 1 0.0000000 1 1.000000
#> 2 0.6030151 1 1.804020
#> 3 1.2060302 1 2.304020
#> 4 1.8090452 1 2.706030
#> 5 2.4120603 1 3.216080
#> 6 3.0150754 1 4.011596
# predict
pred_m_ds5 <- predict(m_ds5, newdata = newdat)
pred_m_ds5_comp <- predict(m_ds5, newdata = newdat, as.c = TRUE)
We want a uniform series on the age scale, but have to input values on the rank scale in the model which is why we use linwarp(). To give a sense of what the function did, we can plot the ranks against the age.
On the rank scale :
Back on the age scale :
# make a 'reference object'
r_ds5 <- list(interpGE = pred_m_ds5, time.series = newdat$age)
ae_ds5 <- ae(ds5$g, r_ds5$interpGE, r_ds5$time.series)
ae_ds6 <- ae(ds6$g, r_ds5$interpGE, r_ds5$time.series)
We’ll build the same model (not the optimal model !) without considering uneven sampling, for comparison.
m_ds5_norank <- ge_im(X = ds5$g, p = ds5$p, formula = "X ~ s(age, bs = 'ds') + batch", nc = nc)
#> aCC aRE MSE aRMSE
#> 0.8477562 0.2153231 0.01803523 0.1342953
pred_m_ds5_norank <- predict(m_ds5_norank, newdata = newdat)
pred_m_ds5_comp_norank <- predict(m_ds5_norank, newdata = newdat, as.c = TRUE)
Let’s plot the components on the rank and age scales, as before.
Back on the age scale :
We can already see that the model has trouble predicting the dynamics accurately. For example, we pick up noise in PC6 and flatten dynamics in PC7.
This has consequences on the estimates of the validation dataset, as you’ll see when we stage the samples.
# make a 'reference object'
r_ds5_norank <- list(interpGE = pred_m_ds5_norank, time.series = newdat$age)
ae_ds5_norank <- ae(ds5$g, r_ds5_norank$interpGE, r_ds5_norank$time.series)
ae_ds6_norank <- ae(ds6$g, r_ds5_norank$interpGE, r_ds5_norank$time.series)
The “gaps” or “steps” you can see on the validation dataset’s estimates are due to interpolation bias : the picked up noise and flattened dynamics mentioned above. Essentially, the model errors create local “unlikely/unrealistic” gene expression zones, which will not correlate well with the samples of corresponding age. These zones will most often find themselves in the gaps between timepoints of the dataset used to build the reference, meaning the estimates of its samples are unaffected.
While we’ve used a suboptimal model here (which clearly displays the model errors on component plots), some interpolation bias can be much more subtle. In such cases, this is something that can only be assessed using an external dataset.
Aeschimann, Florian, Pooja Kumari, Hrishikesh Bartake, Dimos Gaidatzis, Lan Xu, Rafal Ciosk, and Helge Großhans. 2017. “LIN41 Post-Transcriptionally Silences mRNAs by Two Distinct and Position-Dependent Mechanisms.” Molecular Cell 65 (3). Elsevier: 476–89.
Alter, Orly, Patrick O Brown, and David Botstein. 2000. “Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling.” Proceedings of the National Academy of Sciences 97 (18). National Acad Sciences: 10101–6.
Graveley, Brenton R, Angela N Brooks, Joseph W Carlson, Michael O Duff, Jane M Landolin, Li Yang, Carlo G Artieri, et al. 2011. “The Developmental Transcriptome of Drosophila Melanogaster.” Nature 471 (7339). Nature Publishing Group: 473.
Hendriks, Gert-Jan, Dimos Gaidatzis, Florian Aeschimann, and Helge Großhans. 2014. “Extensive Oscillatory Gene Expression During c. Elegans Larval Development.” Molecular Cell 53 (3). Elsevier: 380–92.
Levin, Michal, Leon Anavy, Alison G Cole, Eitan Winter, Natalia Mostov, Sally Khair, Naftalie Senderovich, et al. 2016. “The Mid-Developmental Transition and the Evolution of Animal Body Plans.” Nature 531 (7596). Nature Publishing Group: 637.
White, Richard J, John E Collins, Ian M Sealy, Neha Wali, Christopher M Dooley, Zsofia Digby, Derek L Stemple, et al. 2017. “A High-Resolution mRNA Expression Time Course of Embryonic Development in Zebrafish.” Elife 6. eLife Sciences Publications Limited: e30860.